{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyNtGurXJvdCqbMNGCM35Inr",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tomasonjo/blogs/blob/master/llm/Neo4jOpenAIApoc.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EehUBcly6lRg",
        "outputId": "a764076b-f9dc-4439-f521-c6fb1d533135"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: neo4j in /usr/local/lib/python3.10/dist-packages (5.9.0)\n",
            "Requirement already satisfied: pytz in /usr/local/lib/python3.10/dist-packages (from neo4j) (2022.7.1)\n"
          ]
        }
      ],
      "source": [
        "!pip install neo4j"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Integrate LLM workflows with Knowledge Graph using Neo4j and APOC\n",
        "## OpenAI and VertexAI endpoints are now available as APOC Extended procedures\n",
        "Probably a day doesn't go by that you don't hear about new and exciting things happening in the Large Language Model (LLM) space. There are so many opportunities and use cases for any company to utilize the power of LLMs to enhance their productivity, transform or manipulate their data, and be used in conversational AI and QA systems.\n",
        "To make it easier for you to integrate LLMs with Knowledge Graphs, the team at Neo4j has begun the journey of adding support for LLM integrations. The integrations are available as APOC Extended procedures. At the moment, OpenAI and VertexAI endpoints are supported, but we plan to add support for many more.\n",
        "When I was brainstorming what would be a cool use case to demonstrate the newly added APOC procedures, my friend Michael Hunger suggested an exciting idea. What if we used graph context, or the neighborhood of a node, to enrich the information stored in text embeddings? That way, the vector similarity search could produce better results due to the increased richness of embedded information. The idea is simple but compelling and could be helpful in many use cases.\n",
        "\n",
        "## Neo4j environment setup\n",
        "In this example, we will use both the APOC and Graph Data Science libraries. Luckily, Neo4j Sandbox projects have both libraries installed and additionally come with a prepopulated database. Therefore, you can set up the environment with a couple of clicks. We will use [the small Movie project](https://sandbox.neo4j.com/?usecase=movies) to avoid incurring a more considerable LLM API cost."
      ],
      "metadata": {
        "id": "Dz238wQH1UEd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define Neo4j connections\n",
        "from neo4j import GraphDatabase\n",
        "host = 'bolt://44.215.124.63:7687'\n",
        "user = 'neo4j'\n",
        "password = 'steel-orders-reproduction'\n",
        "driver = GraphDatabase.driver(host,auth=(user, password))\n",
        "\n",
        "def run_query(query, params={}):\n",
        "    with driver.session() as session:\n",
        "        result = session.run(query, params)\n",
        "        return result.to_df()"
      ],
      "metadata": {
        "id": "QM4VE2Q_6n0D"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "![1_FdRXxPhWJSaOmDIeHLoWGw.png]()\n",
        "\n",
        "The dataset contains Movie and Person nodes. There are only 38 movies, so we are dealing with a tiny dataset. The information provides a movie's title and tagline, when it was released, and who acted in or directed it.\n",
        "## Constructing text embedding values\n",
        "\n",
        "We will be using the OpenAI API endpoints. Therefore, you will end to create an OpenAI account if you haven't already.\n",
        "\n",
        "As mentioned, the idea is to use the neighborhood of a node to construct its text embedding representation. Since the graph model is simple, we don't have a lot of creative freedom. We will create text embedding representations of movies by using their properties and neighbor information. In this instance, the neighbor information is only about its actors and directors. However, I believe that this concept can be applied to more complex graph schema and be used to improve your vector similarity search applications.\n",
        "\n"
      ],
      "metadata": {
        "id": "CgJdlP841fj4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "![1_HiCYExxmRoJ4K2KolV2B1w.png]()\n",
        "\n",
        "The typical approach we see nowadays, where we simply chunk and embed documents, might fail when looking for information that spans multiple documents. This problem is also known as multi-hop question answering. However, the multi-hop QA problem can be solved using knowledge graphs. One way to look at a knowledge graph is as condensed information storage. For example, an information extraction pipeline can be used to extract relevant information from various records. Using knowledge graphs, you can represent highly-connected information that spans multiple documents as relationships between various entities.\n",
        "\n",
        "One solution is to use LLMs to generate a Cypher statement that can be used to retrieve connected information from the database. Another solution, which we will use here, is to use the connection information to enrich the text embedding representations. Additionally, the enhanced information can be retrieved at query time to provide additional context to the LLM from which it can base its response.\n",
        "\n",
        "The following Cypher query can be used to retrieve all the relevant information about the movie nodes from their neighbors."
      ],
      "metadata": {
        "id": "yz60G2IR1sX3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(run_query(\"\"\"\n",
        "MATCH (m:Movie)\n",
        "MATCH (m)-[r:ACTED_IN|DIRECTED]-(t)\n",
        "WITH m, type(r) as type, collect(t.name) as names\n",
        "WITH m, type+\": \"+reduce(s=\"\", n IN names | s + n + \", \") as types\n",
        "WITH m, collect(types) as contexts\n",
        "WITH m, \"Movie title: \"+ m.title + \" year: \"+coalesce(m.released,\"\") +\" plot: \"+ coalesce(m.tagline,\"\")+\"\\n\" +\n",
        "       reduce(s=\"\", c in contexts | s + substring(c, 0, size(c)-2) +\"\\n\") as context\n",
        "RETURN context LIMIT 1\n",
        "\"\"\")['context'][0])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "M7_NaY4S7HTa",
        "outputId": "962b21fb-0e90-4799-ad6d-8d2715c11524"
      },
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Movie title: The Matrix year: 1999 plot: Welcome to the Real World\n",
            "ACTED_IN: Emil Eifrem, Hugo Weaving, Laurence Fishburne, Carrie-Anne Moss, Keanu Reeves\n",
            "DIRECTED: Lana Wachowski, Lilly Wachowski\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Depending on your domain, you might also use custom queries that retrieve information more than one hop away or sometimes want to aggregate some results.\n",
        "\n",
        "We will now use OpenAI's embedding endpoint to generate text embeddings representing the movies and their context and store them as node properties."
      ],
      "metadata": {
        "id": "qsjWWgtP10Dg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "openai_api_key = \"OPENAI_API_KEY\""
      ],
      "metadata": {
        "id": "HQuFAs7d_Yl6"
      },
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "run_query(\"\"\"\n",
        "CALL apoc.periodic.iterate(\n",
        "  'MATCH (m:Movie) RETURN id(m) AS id',\n",
        "  'MATCH (m:Movie)\n",
        "   WHERE id(m) = id\n",
        "   MATCH (m)-[r:ACTED_IN|DIRECTED]-(t)\n",
        "   WITH m, type(r) as type, collect(t.name) as names\n",
        "   WITH m, type+\": \"+reduce(s=\"\", n IN names | s + n + \", \") as types\n",
        "   WITH m, collect(types) as contexts\n",
        "   WITH m, \"Movie title: \"+ m.title + \" year: \"+coalesce(m.released,\"\") +\" plot: \"+ coalesce(m.tagline,\"\")+\"\\n\" +\n",
        "        reduce(s=\"\", c in contexts | s + substring(c, 0, size(c)-2) +\"\\n\") as context\n",
        "   CALL apoc.ml.openai.embedding([context], $apiKey) YIELD embedding\n",
        "   SET m.embedding = embedding',\n",
        "  {batchSize:1, retries:3, params: {apiKey: $apiKey}})\n",
        "\"\"\", {'apiKey': openai_api_key})['errorMessages'][0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wzeKmQnT9Mqv",
        "outputId": "8ee5f12c-828d-4cb9-aa35-d6be9a832e47"
      },
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{}"
            ]
          },
          "metadata": {},
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The newly added apoc.ml.openai.embeddingprocedures make generating text embeddings very easy using OpenAI's API. We wrap the API call with apoc.periodic.iterate to batch the transactions and introduce the retry policy.\n",
        "\n",
        "# Retrieval-augmented LLMs\n",
        "\n",
        "It looks like the mainstream trend is to provide LLMs with external information at query time. We can even find OpenAI's guides how to provide relevant information as part of the prompt to generate the answer.\n",
        "\n",
        "![1_zydD2GKzjpEyvL-d_cP0vA.png]()\n",
        "\n",
        "Here, we will use vector similarity search to find relevant movies given the user input. The workflow is the following:\n",
        "We embed the user question with the same text embedding model we used to embed node context information\n",
        "We use the cosine similarity to find the top 3 most relevant nodes and return their information to the LLM\n",
        "The LLM constructs the final answer based on the provided information\n",
        "\n",
        "Since we will be using the gpt-3.5-turbo model to generate the final answer, it is a good practice to define the system prompt. To make it more readable, we will define the system prompt as Python variable and then use query parameters when executing Cypher statements.\n"
      ],
      "metadata": {
        "id": "h6FbeBKO12H4"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "system_prompt = \"\"\"\n",
        "You are an assistant that helps to generate text to form nice and human understandable answers based.\n",
        "The latest prompt contains the information, and you need to generate a human readable response based on the given information.\n",
        "Make the answer sound as a response to the question. Do not mention that you based the result on the given information.\n",
        "Do not add any additional information that is not explicitly provided in the latest prompt.\n",
        "I repeat, do not add any information that is not explicitly given.\n",
        "\"\"\""
      ],
      "metadata": {
        "id": "KR8qsmX72AQ_"
      },
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Next, we will define a function that constructs a user prompt based on the user question and the provided context from the database."
      ],
      "metadata": {
        "id": "vK2Xj7ky2BGr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def generate_user_prompt(question, context):\n",
        "    return f\"\"\"\n",
        "   The question is {question}\n",
        "   Answer the question by using the provided information:\n",
        "   {context}\n",
        "   \"\"\""
      ],
      "metadata": {
        "id": "tGHG1dOi2DlI"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Before asking the LLM to generate answers, we must define the intelligent search tool that will provide relevant context information based on the vector similarity search. As mentioned, we need to embed the user input and then use the cosine similarity to identify relevant nodes. With graphs, you can decide the type of information you want to retrieve and provide as context. In this example, we will return the same context information that was used to generate text embeddings along with similar movie information."
      ],
      "metadata": {
        "id": "GXrNjboh2EWH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def retrieve_context(question, k=3):\n",
        "    data = run_query(\n",
        "        \"\"\"\n",
        "    // retrieve the embedding of the question\n",
        "    CALL apoc.ml.openai.embedding([$question], $apiKey) YIELD embedding\n",
        "    // match relevant movies\n",
        "    MATCH (m:Movie)\n",
        "    WITH m, gds.similarity.cosine(embedding, m.embedding) AS score\n",
        "    ORDER BY score DESC\n",
        "    // limit the number of relevant documents\n",
        "    LIMIT toInteger($k)\n",
        "    // retrieve graph context\n",
        "    MATCH (m)--()--(m1:Movie)\n",
        "    WITH m,m1, count(*) AS count\n",
        "    ORDER BY count DESC\n",
        "    WITH m, apoc.text.join(collect(m1.title)[..3], \", \") AS similarMovies\n",
        "    MATCH (m)-[r:ACTED_IN|DIRECTED]-(t)\n",
        "    WITH m, similarMovies, type(r) as type, collect(t.name) as names\n",
        "    WITH m, similarMovies, type+\": \"+reduce(s=\"\", n IN names | s + n + \", \") as types\n",
        "    WITH m, similarMovies, collect(types) as contexts\n",
        "    WITH m, \"Movie title: \"+ m.title + \" year: \"+coalesce(m.released,\"\") +\" plot: \"+ coalesce(m.tagline,\"\")+\"\\n\" +\n",
        "          reduce(s=\"\", c in contexts | s + substring(c, 0, size(c)-2) +\"\\n\") + \"similar movies:\" + similarMovies + \"\\n\" as context\n",
        "    RETURN context\n",
        "  \"\"\",\n",
        "        {\"question\": question, \"k\": k, \"apiKey\": openai_api_key},\n",
        "    )\n",
        "    return data[\"context\"].to_list()"
      ],
      "metadata": {
        "id": "E4EGxJuLAm1y"
      },
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "At the moment, you need to use the gds.similarity.cosine function to calculate the cosine similarity between the question and relevant nodes. After identifying the relevant nodes, we retrieve the context using two additional MATCHclauses. You can check out Neo4j's GraphAcademy to learn more about Cypher query language.\n",
        "\n",
        "Finally, we can define the function that takes in the user question and returns an answer."
      ],
      "metadata": {
        "id": "PlPZeC3k2JML"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def generate_answer(question):\n",
        "    # Retrieve context\n",
        "    context = retrieve_context(question)\n",
        "    # Print context\n",
        "    for c in context:\n",
        "        print(c)\n",
        "    # Generate answer\n",
        "    response = run_query(\n",
        "        \"\"\"\n",
        "  CALL apoc.ml.openai.chat([{role:'system', content: $system},\n",
        "                      {role: 'user', content: $user}], $apiKey) YIELD value\n",
        "  RETURN value.choices[0].message.content AS answer\n",
        "  \"\"\",\n",
        "        {\n",
        "            \"system\": system_prompt,\n",
        "            \"user\": generate_user_prompt(question, context),\n",
        "            \"apiKey\": openai_api_key,\n",
        "        },\n",
        "    )\n",
        "    return response[\"answer\"][0]"
      ],
      "metadata": {
        "id": "AWL1XQh62Jax"
      },
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's test our retrieval-augmented LLM workflow."
      ],
      "metadata": {
        "id": "pfUcyFnG2NW9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "generate_answer(\"Who played in the Matrix?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 313
        },
        "id": "MSiHagee_70-",
        "outputId": "398d15be-da08-4a6f-bcb5-d00b9dd1a0f1"
      },
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Movie title: The Matrix year: 1999 plot: Welcome to the Real World\n",
            "ACTED_IN: Emil Eifrem, Hugo Weaving, Laurence Fishburne, Carrie-Anne Moss, Keanu Reeves\n",
            "DIRECTED: Lana Wachowski, Lilly Wachowski\n",
            "similar movies:The Matrix Revolutions, The Matrix Reloaded, V for Vendetta\n",
            "\n",
            "Movie title: The Matrix Reloaded year: 2003 plot: Free your mind\n",
            "DIRECTED: Lana Wachowski, Lilly Wachowski\n",
            "ACTED_IN: Hugo Weaving, Laurence Fishburne, Carrie-Anne Moss, Keanu Reeves\n",
            "similar movies:The Matrix Revolutions, The Matrix, V for Vendetta\n",
            "\n",
            "Movie title: The Matrix Revolutions year: 2003 plot: Everything that has a beginning has an end\n",
            "DIRECTED: Lana Wachowski, Lilly Wachowski\n",
            "ACTED_IN: Hugo Weaving, Laurence Fishburne, Carrie-Anne Moss, Keanu Reeves\n",
            "similar movies:The Matrix Reloaded, The Matrix, V for Vendetta\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'The actors who played in The Matrix are Emil Eifrem, Hugo Weaving, Laurence Fishburne, Carrie-Anne Moss and Keanu Reeves.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "generate_answer(\"Recommend a movie with Jack Nicholson?\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 331
        },
        "id": "XepSm2miD7ac",
        "outputId": "b86d6d0c-cea7-49bc-95d6-299c4fbd84ec"
      },
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Movie title: Something's Gotta Give year: 2003 plot: \n",
            "ACTED_IN: Keanu Reeves, Diane Keaton, Jack Nicholson\n",
            "DIRECTED: Nancy Meyers\n",
            "similar movies:Something's Gotta Give, The Replacements, Johnny Mnemonic\n",
            "\n",
            "Movie title: One Flew Over the Cuckoo's Nest year: 1975 plot: If he's crazy, what does that make you?\n",
            "ACTED_IN: Danny DeVito, Jack Nicholson\n",
            "DIRECTED: Milos Forman\n",
            "similar movies:Hoffa, As Good as It Gets, Something's Gotta Give\n",
            "\n",
            "Movie title: As Good as It Gets year: 1997 plot: A comedy from the heart that goes for the throat.\n",
            "ACTED_IN: Helen Hunt, Jack Nicholson, Cuba Gooding Jr., Greg Kinnear\n",
            "DIRECTED: James L. Brooks\n",
            "similar movies:A Few Good Men, Cast Away, Twister\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'If you\\'re looking for a movie recommendation featuring Jack Nicholson, I\\'d suggest checking out \"One Flew Over the Cuckoo\\'s Nest\" from 1975. The movie stars Danny DeVito and Jack Nicholson, and was directed by Milos Forman. It\\'s a classic drama that portrays the struggles of patients in a mental institution.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "generate_answer(\"What are similar movies to As Good as It Gets?\")"
      ],
      "metadata": {
        "id": "DtWRYPGnEgv9",
        "outputId": "33ab60fd-4fdc-4adf-df15-ef8d8ae0e80e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 331
        }
      },
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Movie title: Something's Gotta Give year: 2003 plot: \n",
            "ACTED_IN: Keanu Reeves, Diane Keaton, Jack Nicholson\n",
            "DIRECTED: Nancy Meyers\n",
            "similar movies:Something's Gotta Give, The Replacements, Johnny Mnemonic\n",
            "\n",
            "Movie title: As Good as It Gets year: 1997 plot: A comedy from the heart that goes for the throat.\n",
            "ACTED_IN: Helen Hunt, Jack Nicholson, Cuba Gooding Jr., Greg Kinnear\n",
            "DIRECTED: James L. Brooks\n",
            "similar movies:A Few Good Men, Cast Away, Twister\n",
            "\n",
            "Movie title: The Devil's Advocate year: 1997 plot: Evil has its winning ways\n",
            "DIRECTED: Taylor Hackford\n",
            "ACTED_IN: Al Pacino, Charlize Theron, Keanu Reeves\n",
            "similar movies:That Thing You Do, Something's Gotta Give, The Replacements\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'There are a few similar movies to \"As Good as It Gets\" that you may enjoy. If you liked the combination of comedy and drama in the plot, you may also enjoy \"A Few Good Men\" and \"Cast Away\". If you enjoyed the acting of Jack Nicholson, you might also like \"The Replacements\" and \"Johnny Mnemonic\", both of which he had a role in.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Summary\n",
        "And there you have it, a glimpse into the fascinating world of integrating Large Language Models with Knowledge Graphs. As the field continues to evolve, so too will the tools and techniques at our disposal. With Neo4j and APOC's continued advancements, we can expect even greater innovation in how we handle and process data."
      ],
      "metadata": {
        "id": "uvT3ajut2Q9z"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "-1pwS5zJx28Y"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}